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Abstract 



An important aspect of data integration involves answering queries using various resources 
rather than by accessing database relations. The process of transforming a query from the 
database relations to the resources is often referred to as query folding or answering queries using 
\ views, where the views are the resources. We present a uniform approach that includes as special 

' cases much of the previous work on this subject. Our approach is logic-based using resolution. 

\ We deal with integrity constraints, negation, and recursion also within this framework. 



^ 1 Introduction 

o 



An important part of data integration involves answering queries using various resources rather than 
by accessing database relations. The process of transforming a query from the database relations 
5^ \ to the resources is often referred to as query folding or as answering queries using views, where the 

views are the resources. For instance, a database of interest to a user may be distributed over a 
network. It is necessary to bring data distributed over a network to a user's machine so that the 
data may be manipulated to answer user queries. In a distributed environment it is likely that one 
will want to save answers to queries in the local user's machine so that if the same or a related 
query is posed to the distributed database, one can look in the local machine's cached database for 
the answer, rather than have to go out over the network to answer the query. In this situation the 
resources are the cached relations and the use of these resources is an important aspect of query 
optimization. In some data integration systems the database relations are themselves virtual and 
the data must be obtained from the resources. Resources may also be materialized views. 

Several researchers have considered various aspects of this problem. In this paper we present a logic- 
based approach to the query folding problem using the method of resolution. As a consequence. 



1. We obtain a uniform approach that includes as special cases much of the previous work on 
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this topic. 



2. If the algorithm finds a rewriting of the query, then we are guaranteed that the answers are 
sound, that is, the answers are correct. 

3. The approach also allows us to determine under certain conditions if the folded query contains 
all answers to the original query, that is, it is complete. 

We consider a database that consists of an extensional database (EDB), an intensional database 
(IDB), a set of integrity constraints (ICs), and a set resources (ResDs)) where the resources have 
been obtained by using resource rules. These resources are referred to as materialized views. That 
is, they have been made explicit in a local computer as a result, for example, of an answer to a 
conjunctive query. The EDB, IDB, ICs are part of a conventional Datalog database. 

Section 2 describes related work, complexity results, systems and algorithms that have been devel- 
oped with respect to the folding problem. Section 3 provides background for the definitions and 
notations used in the paper. Section 4 contains several examples and our query folding algorithm. 
The algorithm is logic-based and deals uniformly with integrity constraints, extensional and in- 
tensional predicates and extends the work in [Qia96]. Functional and inclusion dependencies are 
considered in Section 5. We show that our query folding algorithm handles all integrity constraints 
in a uniform way without the need for specialized techniques as in [DGQ96, Gry98] and [DPT99, 
PDSTOO], where the latter also consider physical access structures, which we do not treat. Section 
6 deals with the case where resources are obtained by the use of several definitions or queries. 
Our work on multiple rules for the same resource relates to the work in [Dus97b, AGK98, FGOl]. 
Whereas they arc concerned primarily with maximal containment, we are concerned with a uniform 
method that checks both for soundness and completeness of answers. Section 7 discusses negation. 
We show that the logic-based framework handles stratified negation in both rules and intensional 
predicates. Our approach is different from the rewriting used in [FGOl]. Recursion is considered in 
Section 8. We differ from the work in [DG97a] and [DGLOO] by handling a special case of recursive 
views as well as recursive queries. We compare the contributions made in this paper with other 
efforts in Section 8. The paper is summarized in Section 9. 

2 Related Work 

There has been a substantial amount of work done in connection with data integration and query 
folding. Apparently the first papers to propose algorithms for query folding were in [LY85, YL87], 
where they developed an algebraic method. Other early work was by [TSI94] and by [CKPS95]. An 
algorithm for rewriting conjunctive queries over non-recursive databases was provided in [Qia96]. 
In [DGQ96, Gry98] it is shown how to use materialized views in the presence of functional and 
inclusion dependencies. Additional work on conjunctive query optimization and on information 
integration appears in [U1197, LR096a, LR096b, DG97a]. Levy et al. [LMSS95] showed that the 
question of determining whether a conjunctive query can be rewritten to an equivalent conjunctive 
query that only uses views is NP-complete. This work was extended in [RSU95] to include binding 
patterns in view definitions. Duschka [Dus97a], discusses the concept of local completeness, where 
it is known that some subset of the data an information source stores is complete, although the 
entire data stored by the information source might not be complete. [DG97a] were the first to 
extend the work to general recursive queries. They show that the problem of whether a Datalog 
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program can be rewritten into an equivalent program that only uses views, is undecidable. [DGLOO] 
also discuss answering queries using views to recursive queries for Datalog programs. Duschka and 
Levy [DL97] introduce the new class of recursive query plans for information gathering. Plans are 
extended to be recursive sets of function-free Horn clauses. In his thesis, Duschka [Dus97b] deals 
with multiple definitions of the same resource and shows how to obtain a maximally contained query 
using these definitions. Duschka and Gencsercth [DG98] were the first to publish results concerning 
how to handle multiple definitions of the same resource predicate. Afrati et al. [AGK98] extend 
this work to disjunctive queries and related results. Popa and his co-authors, [DPT99, PDSTOO], 
showed how to do query folding with some forms of integrity constraints using the so-called 'chase 
method'. Flesca and Greco [FGOl] deal with disjunction and negation and formulate their answers 
in terms of classical and default negation. The complexity of answering queries using materialized 
views for conjunctive queries with inequality, positive queries, Datalog and first-order logic is 
addressed in [AD98] . The paper [LevOl] surveys the methods proposed for answering queries using 
views. Sec Ullman, [U1197] , for a survey of work concerning information-integration tools to answer 
queries using views that represent the capabilities of information resources. The formal basis of 
techniques related to containment algorithms for conjunctive queries and/or Datalog programs 
is discussed there. Approaches taken by AT&T Labs's Information Manifold and the Stanford 
Tsimmis, [GMPQ"'"95] project are compared. Levy in [LevOO] describes several algorithms proposed 
for data integration: the bucket and inverse-rules algorithms. 

Several papers address complexity problems associated with folding, Chandra and Merlin, [CM77] 
showed that the query containment problem is NP-complete. Several subclasses of conjunctive 
queries were identified that have polynomial-time containment algorithms [ASU79a, ASU79b, JK83]. 

The query folding problem is thus at least NP-hard. It has been shown to be NP-complete for 
conjunctive queries and resources in [LMSS95]. Many variants of the problem of answering queries 
using views are discussed in [LevOl]. The problem was shown to be NP-complete even when queries 
describing the sources and the user query are conjunctive and do not contain interpreted predicates 
([LMSS95]). [LMSS95] further show that in the case of conjunctive queries, the candidate rewritings 
can be limited to those that have at most the number of subgoals in the query. The complexity of 
the problem is polynomial in the number of views (i.e., the number of data sources in the context 
of data integration). Since query containment is a special case of query folding, Qian's algorithm, 
[Qia96] degenerates to a polynomial-time containment algorithm for the class of acyclic conjunctive 
queries. 

Abiteboul and Duschka [AD98] show that recursion and negation in the view definition lead to 
undecidability. They show that the Closed World Assumption (CWA) complicates the problem. 
Under the Open World Assumption (OWA), the certain answers in the conjunctive view defini- 
tions/Datalog queries case can be computed in polynomial time. On the other hand, the con- 
junctive view definitions/conjunctive queries case is co-NP-completc under the CWA. They prove 
that inequalities (a weak form of negation) lead to intractability. Even under the OWA, adding 
inequalities to the queries, or disjunction to the view definitions make the problem co-NP hard. In 
his thesis, Duschka, [Dus97b] provides a summary of results in complexity associated with these 
results. 

Chekuri and Rajaraman, [CR97], present polynomial-time algorithms to test the containment of an 

arbitrary conjunctive query in an acyclic query and to minimize an acyclic query. They generalize 
the query containment and minimization algorithms to arbitrary queries. They consider the problem 
of finding an equivalent rewriting of a conjunctive query Q using a set of views V defined by 
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conjunctive queries, when Q does not have repeated predicates, and show how their algorithms for 
query containment can be modified for this problem. A restricted variant of this problem, where 
neither Q nor the views in V use repeated predicates, is known to be NP — complete [LMSS95]. 

Duschka and Genesereth, [DG98], treat views that may be defined by disjunction. Their focus is on 
maximal query containment. They show a duality between a query plan being maximally contained 
in a query and this plan computing exactly the certain answers. They show that the plan they 
generate is maximally contained in the query and that the disjunctive plan can be evaluated in 
co-NP time. The complexity results described above also apply to the problems that we discuss in 
this paper. 

Several systems, and a number of algorithms have been implemented for the folding problem. 
Levy, Rajaraman, and Ordille [LR096b], developed the Information Manifold System at AT&T 
Labs. The system incorporates the bucket algorithm, which controls search by first considering 
each subgoal in a query in isolation, and creating a bucket that contains only the views relevant to 
that subgoal. The algorithm then creates rewritings by combining one view from each bucket. 

Qian and Duschka and Genesereth [Qia96, DG97a, DG97b, Dus97b], are responsible for the Info- 
Master System. They use the inverse-rules algorithm, and consider rewritings for each database 
relation independent of any particular query. Given a user query, these rewritings are combined ap- 
propriately. They show that rewritings produced by the inverse-rules algorithm need to be further 
processed in order to be appropriate for query evaluation. They show this additional processing 
step duplicates much of the work done in the second phase of the bucket algorithm. The bucket al- 
gorithm is also shown to have several deficiencies and does not scale up. Details of these algorithms 
are presented by Levy in [LevOO]. 

Pottinger and Levy [PLOO] have developed a scalable algorithm for answering queries with views. 
They describe and analyze the bucket and the inverse-rule algorithms. They then describe the 
MiniCon algorithm, for finding the maximally contained rewriting of a conjunctive query using 
conjunctive views. They provide the first experimental study of algorithms for answering queries. 
They show that the MiniCon algorithm both scales up and significantly outperforms the previous 
algorithms. They further develop an extension of the MiniCon algorithm to handle comparison 
predicates, and show its performance experimentally. 

Afrati, Li, and Ullman [ALUOl], discuss generating efficient, equivalent rewritings using views 
to compute the answer to a query. Each rewriting of a query is passed as a logical plan to an 
optimizer, which translates the rewriting to a physical plan. Each physical plan accesses the stored 
("materialized") views, and applies a sequence of relational operators to compute the answer to the 
original query. They consider three cost models for evaluating the efficacy of a plan. They develop 
and experimentally evaluate an efficient algorithm, CoreCover, based on a simple cost model Mi 
that counts only the number of subgoals in a physical plan. 

3 Background 

This section contains a summary of the background and notation used in this paper. We use the 
language and terminology of logic databases (also known as deductive databases) ([Das92], [Llo87], 
and [LMR92]). Logic databases express data, rules (views), integrity constraints and queries in 
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first-order logic ([BJ89] and [Llo87]). ^ 



We use standard syntax for first-order logic, with the usual symbols for variables, connectives, 
quantifiers, punctuation, equality, constants, and predicates. The notions of term, formula, and 
sentence (a formula with no free variables) are defined in the usual way. We do not necessarily 
restrict formulas to be function-free. We shall make it clear when we utilize function symbols. Any 
formula may be considered as a query (although, below, we restrict the form of standard queries). 
A formula is called ground if it contains no variables. 

A substitution is a set of substitution pairs, for example, {X/a,Y/h}, such that every element of a 
substitution pair is a variable or constant (or, more generally, a term), and such that the collection 
of left-hand sides of the substitution pairs — X and Y in our example — arc unique variables. A 
substitution applied to a formula is a rewrite of the formula by replacing any occurrence in the 
formula of a left-hand element from the substitution by its right-hand counterpart, in parallel. Let 
the formula !F be p{X,Y) and the substitution 6 be again {X/a, y/6}. The substitution 9 applied 
to formula written as !F9, is the formula p{a,b). 

We assume that the reader is familiar with the unification algorithm [Rob65]. The unification 
algorithm takes a set of relations with the same relation name and attempts to find a substitution 
for the variables that will make the relations all identical. 

An important class of sentence is the clause. A clause has the general form: 

V.^l V . . . V Afc V V ... V ^An (1) 

in which each is an atomic formula, for i G {1, . . . , n}, and in which the variables are understood 
to be universally quantified (denoted by 'V'). Any clause can be written in a logically equivalent 
form as an implication. 

V.Ai V . . . V Afc ^ Ak+1 A... A An. (2) 
This is often written in further shorthand as 

Ai, . . . ,Ak Ak^i, . . . ,An. (3) 

in which disjunction is assumed on the left-hand side of the implication arrow, conjunction on the 
right-hand side, and the universal quantification is understood. A clause in this form is also called 
a rule. The collection of atoms on the left-hand side (^i, . . . , A^) is called the head of the rule, and 
the collection of atoms on the right-hand side (^fe+i, . . . , An) the body. When k = n, the body is 
empty, and when A; = 0, the head is empty. A Horn rule has at most one atom in the head: k < 1. 
A definite clause has exactly one atom in the head: k = 1. A ground rule contains no variables. In 
a rule a variable is called limited if it appears in the body of an ordinary (non built-in) atom or in 
an equality with a constant or a variable that is limited. A rule is called safe if all its variables are 
limited. 

A rule with an empty head is generally considered to be a query or an integrity constraint. An 
answer to a query is a ground substitution of the query formula such that the resulting ground 
formula is true with respect to the database; that is, the grounded query formula is logically entailed 
by the database. A ground rule with an empty body is called a fact. Definite rules have a clear 
procedural interpretation. Consider 

A^Bu...,Bn. (4) 

^Some proposals for integrity constraints express them in a higher-order logic, while keeping the database — the 
facts, rules, and, sometimes, queries — in first-order. 
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We call this clause a rule for A. The above rule can be interpreted to say that A is shown (or proven) 
whenever all the i?j's are shown (proven)? Rules are essentially views, in the parlance of relational 
databases. Logically, rules are more expressive than views in relational databases because recursion 
is permitted.^ A fact then, having no conditions in the body of its "rule" , is simply interpreted as 
true. Wc define an expanded rule ([Cha85, CGM88]) to be one in which all predicates have been 
expanded. An expanded predicate is one in which all constants and repeated variables have been 
replaced by unique new variables, and the appropriate equalities have been added to the body of 
the rule in which the predicate appears ([CGM90]). This is related to the term rectified set of rules 
where the head of each rule in the set is identical with each argument a distinct variable ([U1189]). 

A database may then be defined as a collection of rules and facts. When all the rules and facts are 
definite (that is, the rules and facts have at most a single atom in the heads of the clauses that 
define them), the database is called definite. It is called disjunctive (or indefinite) otherwise. Wc 
call the language in which the database is written with definite clauses as defined above Datalog 
[U1189]. Recall that terms in clauses are function-free, as noted above, hence all Datalog terms 
are function-free. When disjunctive clauses are permitted for rules or facts (and whose terms are 
function- free) , wc call the language Disjunctive Datalog [EGM97, LMR92]. A database DB often 
is defined as consisting of two parts: 

• the cxtcnsional database, EDB, and 

• the intensional database, IDE. 

The EDB is the database's collection of facts. The IDB is the database's collection of rules. 
(We soon redefine databases to have two additional components, the set of the database's integrity 
constraints (ICs) and the set of resource rules (ResDs))- 

Conventionally, negative data is not represented explicitly in a logic database. There are several 
standard approaches to allow negative data to be inferred. The closed world assumption (CWA) is 
a default rule for the inference of negative facts [Rci78b]. For any ground atom A, the negation of 
A is accepted as true if A is not provable from the database. The set of all negated atoms inferable 
in this way is written as CWA pB] . Another approach to negation is the Clark completion of a 
datal)asc [Cla78]. This formalizes the concept that the set of tTipIcs true for a predicate is precisely 
the set that can be proven to be true via the facts and rules. In brief, this is accomplished by 
adding a formula to the database for each predicate (to correspond with the collection of rules for 
that predicate) , to supply the logical only if half of the definition of the predicate. Certain negated 
facts are then deducihle from the completed, database, the database with these "only if" formulas 
added. We refer to [Cla78] for the precise definition. In our application we will typically have a 
situation where a resource predicate is defined by some extensional predicates, such as 

r{X,Y)^h{X,Z),k{Z,Y) (5) 

where r is the resource predicate and h and k are extensional predicates, saying essentially that if 
< a, 6 > is in the join of h and A;, it is in r. The Clark completion changes the implication to an 

^Thc interpretation for disjunctive rules is less apparent. Essentially, a disjunctive rule states that at least one of 
the atoms in the rule's head is proven whenever all the atoms in the body are. 

^The SQL-3 standard extends SQL to support recursion [MS93] , however. So once SQL-3 becomes the standard, 
this difference in expressiveness will go away, since any relational database that supports SQL-3 will, in fact, be a 
deductive database system. 
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equivalence to say that < a, 6 > is in the join of h and k if and only if < a, 6 > is in r. When we 
compute the Clark completion of a rule such as in clause 5 we obtain two rules, one for h and one for 
k. The variable Z in clause 5 represents an existensionally quantified variable whose value depends 
on the variables X and Y . When we obtain the only- if part of the Clark completion, namely, the 
rules with the implication arrow reversed, the predicates with the variable Z appear in the heads 
of the clauses and are replaced by the Skolem function f{X^Y). Thus, the only-if portion of the 
Clark completion become: 

h{XJ{X,Y))^r{X,Y) (6) 

k{f{X,Y),Y)^r{X,Y) (7) 

In the text, when there is no loss of information, we omit the variable portion of the Skolem 
functions. For example, we replace f{X,Y) by /. 

In our formulas we allow built-in predicates, such as =,<,>. When built-in predicates occur in 
formulas the appropriate axioms need to be added to the theory. The axioms for equality, for 
example, are given below. 

Equality Axioms: 

VX(X = X) 

vxvy((x = y) ^ {Y = X)) 

VXVFVZ((X = Y) h{Y = Z) ^ {X = Z)) 

VXi • • • VX„(P(Xi, • • • , Xn)h (Xi = Fi) A • • • A {Xn = y„) ^ • • • yn)) 

We will discuss the use of equality axioms when they are needed in proofs. 

So far, we have assumed that the body of a database rule (clause) contains only positive atoms. 
However, it is useful sometimes to define database rules that allow negated atoms in the body of 
a rule. We need default negation in logic databases if we want to subsume the relational algebra, 
which includes set difference. We can extend deductive databases with default negation. A rule 
which has a negated atom, i.e. an atom preceded by not in its body is called a normal rule, and 
deductive databases that have normal rules are called normal databases. We call Datalog that has 
been extended with default negation Datalog"'. For example, the normal rule 

p{X) ^ not q{X). (8) 

is interpreted, in general, to mean that, for any constant a, if q{a) is not true (or cannot be proven 
to be true), then p{a) is true. 

We write this negation with not rather than with the symbol for logical negation, and refer to 
it as default negation. This is because most semantics that have been defined for normal databases, 
interpret the use of default negation differently from one another and from logical negation. There 
are a number of semantics that have been defined for normal databases, and no one semantics is 
universally accepted. Also, since the notion of default negation is generally based on provability, 
not logical truth, such default negation is beyond first-order logic. ^ 

Thus it is not equivalent to exchange the rules in the IDB that use default negations with seemingly 
equivalent disjunctive rules, in which the negated atoms in the body have been moved to the head. 
Consider the following example. 

DBi contains two clauses: (1) p ^ not r, and (2) q. 

''We still speak in terms of first-order logic even for normal databases, as most of the first-order framework of 
deductive databases remains applicable. 
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DB2 contains two clauses: (1) p V r <— and (2) q. 

That is, let DBi consist of the single rule for the predicate p and the single fact for the predicate 
q, and DB2 consist of the single disjunctive rule and the single fact for the predicate q. If the 
rule in DBi were written with logical negation instead of default negation, DBi and DB2 would 
be logically equivalent. However, in DBi, we should be able to infer p, because the fact q can 
be inferred (it is a fact), and the fact r cannot be inferred, (thus, not r can be assumed true by 
default). In DB2, p cannot be inferred. Only the weaker, disjunctive fact p V r can be inferred. 
Note that default negation results in non-monotonicity. If we were to add the fact r to DBi above, 
we would no longer be able to infer p. 

The intuition behind the use of default negation becomes confused when it is combined with recur- 
sion. One solution to this confusion is simply not to allow recursive definitions through negation. 
A canonical example of recursion through negation is 



The restriction not to allow recursion through negation leads to what are called stratified databases, 
and such databases have a unique standard model called the perfect model of the database. (Strat- 
ified databases are defined in [ABW88, LMR92]). In some cases, a non-stratified database may 
also have a unique standard model. Some of these cases may be captured by the concept of stable 
database. Two important model semantics for normal databases, and normal logic programs, are 
the well-founded semantics [GRS91] and the stable model semantics or the semantically equiva- 
lent well-supported model semantics [Fag91, GL88]. [Min96] provides a retrospective on work in 
semantics for logic programs and deductive databases. 

4 Query Folding 

This section contains the basic material on query folding. We assume that the resource rules 
define the predicates of the data sources that are conveniently available while the EDB and IDB 
predicates may take longer to use or may be unavailable. Consequently, a query is optimized in 
the sense that it has been rewritten using the data sources, which presumably are readily available. 
The folded query can then be optimized by other well-known techniques. In this section we provide 
an algorithm for such query rewriting in a special case. In later sections we extend this algorithm 
to more complex databases. We start by giving the restrictions on the type of database we consider 
in Sections 4 and 5. Our query folding algorithm is illustrated on an example before it is described. 
We end this section by giving several additional examples. 

4.1 Database Restrictions 

From now on a database will consist of four parts: the EDB, IDB, IC, and ResDB- We assume 
that the resource rules define the predicates of the data sources that are conveniently available while 
the EDB and IDB predicates may take longer to use. In this section we provide an algorithm 
for such query rewriting in a special case. In later sections we will extend this algorithm to more 
complex databases. 



p{X) ^ not q{X). 
q{X) ^ not p{X). 



(9) 
(10) 
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We place the following conditions on the database in this and the following section. 

1. No formula contains negation. 

2. Each IDB predicate may be defined by multiple safe, conjunctive, non-recursive function-free 
Horn rules. 

3. Each distinct Rgsdb predicate is defined by a single safe conjunctive function-free Horn 
formula on EDB and/or IDB predicates. 

4. Each IC clause is a safe function-free Horn formula of the form G F where G is either 
empty or has one EDB predicate and F is a conjunction of EDB predicates. 

5. Each query has the form q G where G is a conjunction of EDB and IDB predicates. 

6. The database includes axioms for built-in predicates as needed. 

When the IDB is non-recursive, it has been shown [Rei78a] that the rules can be compiled so that 
every IDB predicate can be written as a set of rules, each rule in terms only of EDB predicates. 
We assume in this section that the compiled rules replace the original rules. Hence, deduction using 
IDB predicates is effectively one-step. 

In the following we consider the concept of bounded recursion. Minkcr and Nicolas [MN82] were the 
first to show that there are forms of rules that lead to bounded recursion. That is, the deduction 
process using these rules must terminate in a finite number of steps. This work has been extended 
by Naughton and Sagiv [NS87]. We illustrate here one special case of bounded recursion, namely, 
singular rules. 

A recursive rule is singular if it is of the form 

F ARi AA... ARn, 

where F is a conjunction of possibly empty base (i.e. EDB) relations and R, Ri, R2, ■ ■ ■ , Rn are 
atoms that have the same relation name iff: 

1. each variable that occurs in an atom Ri and does not occur in R only occurs in Ri; 

2. each variable in R occurs in the same argument position in any atom Ri where it appears, 
except perhaps in at most one atom Ri that contains all of the variables of R. 

Thus, the rule 

R{X, Y, Z) ^ R{X, Y', Z),R{X, Y, Z') 

representing the multivalued dependency, R : X Y is singular since (a) Y' and Z' appear 

respectively in the first and second atoms in the head of the rule (condition 1), and (b) the variables 
X, y, Z always appear in the same argument position (condition 2) . It is known that all singular 
rules have bounded recursion. 

We now specify three cases for the IC since each case has to be handled in a different manner in 
the folding algorithm: 
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Cases of Integrity Constraints (ICs) 

1. Case 1: The ICs have no recursion and no built-in predicate in the head of a clause. 

2. Case 2: The ICs have bounded recursion and no built-in predicate in the head of a clause. 
For example this is the case if there is a multivalued dependency. 

3. Case 3: Either the ICs are recursive, or there is a built-in predicate in the head of a clause. 
For example, for a functional dependency, an induced recursion may arise since the equality 
axioms are recursive. 



4.2 Illustrative Example 

We start by illustrating our algorithm on a simple example. This example has a simple integrity 
constraint. We deal with functional and inclusion dependencies in the next section. We write the 
formal description afterwards and show how it subsumes other algorithms used for this type of 
query rewriting. 

Example 1. ^UB: pi{X,Y, Z), p2{X,U), P3{X,Y) 
IDB: 

IC: P3{X,Y) ^ pi{X,Y,Z),Z > 

ResDB : r{X,Y,Z) ^ pi{X,Y,Z),p2iX,U) 

Query: q{X, Y) :^ p, {X, Y, Z) , p2 {X, U) ,p^{X,Y),Z>l 

The first step involves reversing the resource rules to define the EDB predicates in terms 
of the resource predicates. This process supplies the only if half of the definition of the re- 
source predicates; so we call these rules the Clark Completion resource rules. In this example 
we obtain for the Clark Completion resource rules: 

CCrrl : pi{X, Y, Z) ^ r{X, Y, Z) 
CCrr2 : p2{X, f{X, Y, Z)) ^ r{X, Y, Z) 



Figure 1 shows the derivation starting with the query as the top clause of a linear resolution 
tree. Both the IC and the Clark Completion resource rules are used. The clause Z > is 
subsumed by the clause Z > 1 . The final rewritten query at the bottom of the tree 
^r{X,Y,Z),Z>\, 

contains only the r predicate and the evaluable predicate >. Using this query we obtain 
correct answers to the original query. 

This example uses an integrity constraint to obtain a clause that contains only a resource 
predicate and an evaluable predicate. When the theory is Horn. Rcitcr has shown [Rci78a] 
that the use of integrity constraints is not necessary to obtain answers. In this case, if the 
integrity constraint were not used, the clause at the end of the proof tree would have been a 
partial folding consisting of a resource predicate, an EDB predicate and an evaluable predi- 
cate. Using the integrity constraint eliminates the EDB predicate and provides an optimiza- 
tion step, as in the case of semantic query optimization [CGM90]. We can illustrate another 
aspect of semantic query optimization by changing the integrity constraint in this example to: 
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Query :^ F, Z),p2(X, U),p^{X, Y),Z>1 

IC:ps{X,Y)^p,{X,Y,Z),Z>0 



Pi{X, Y, Z),p2{X, U),pi{X, Y,Z),Z>0,Z>1 
factor and subsume 

p,iX,Y,Z),p2iX,U),Z>l 

CCrrl : pi{X, Y, Z) ^ r{X, Y, Z) 



r{X,Y,Z),p,{X,U),Z>l 

CCrr2 : p^iX, f{X, Y, Z)) ^ r(X, F, Z) 

\Ulf{X,Y,Z)} 
r(X,F,Z),r(X,F,Z),Z>l 
factor 

r{X,Y,Z),Z> 1 



Figure 1: Literal Elimination Example 
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IC: ^p2{X,U),ps{X,Y) 



This integrity constraint subsumes the query; hence the query has no answers and there 
is no need to try to fold the query. 

4.3 Query Folding Algorithm 

At this point we describe the first version of the query folding algorithm, where the database satisfies 
the six conditions given at the beginning of Section 4.1. In particular each resource predicate is 
defined by a single safe conjunctive formula on EDB. 

As mentioned in Example 1 the algorithm uses the Clark Completion resource rules for the resource 
predicates. We obtain these rules by a preprocessing algorithm that needs to be done only once for 
a database. 

Preprocessing Algorithm (Clark Completion) 

Input: ResDB- We may assume that each resource predicate r is written in the form 

r{X)^_p,{Xi),...,Pn{Xn), 

where X contains variables, each Xi {1 < i < n) consists of terms (constants or variables) 
and X C Ur=i^i- 

Output: The Clark Completion resource rules CCrr in clausal form, 
begin 

Step 1. Apply the Clark Completion to each resource predicate definition to write it as 
r{X)^_3Z{pi{X^),...,Pn{Xn)) 
where Z is the set of variables in (U^Li ^i) ~ 

Step 2. Rewrite the equivalences obtained in Step 1 into rules in clausal form, called the Clark 
Completion resource rules (CCrr) as 
r{X)^p,{X,),...,pn{Xn) 
piiX[)^riX) 

Pn{Xl)l^r{X) 

where X'- (1 < z < n) is obtained from Xi by replacing every variable Xj G Z by fr,j{X). (In 
our examples we will usually use variables such as X, Y, and Z, and function symbols f,g, 
and h, and omit subscripts.) 

end 

Wc now describe the simplest form of the folding algorithm, the one with the database restrictions 
of Section 4.1. 

The algorithm described below uses linear derivation ([CL73]) which includes a backtracking mech- 
anism. Backtracking occurs when wc find a linear derivation that has no resource predicates. 
Before the algorithm commences we assume that there is a test to determine which of the three 
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cases applies to ICs. In Case 1, nothing has to be done. In Case 2, wc assume that the linear 
derivation is modified to include a check to determine if the clause, L, that has been generated, 
satisfies the bounding condition, and if it does, then backtracking occurs. In Case 3, a depth bound 
k is specified and if the depth is reached, backtracking occurs. We also assume that if there are 
built-in predicates such as =,7^,>, then the input clauses are placed in expanded form. If there 
are no built-in predicates, it is unnecessary to do the expansion. 

Folding Algorithm 1 (Finding a Single Folding) 

Input: C: the set of clauses in the EDB, IDB, IC, and CCrr, and 

the query, q{X): <— G{Y), where X C. Y, and G is a conjunction of atoms. We call X the 
query variables. 

Output: A query fq{X): ^ L{Z), where X C Z. 
begin 

Starting with <— G find a linear derivation using C that results in a clause <— L that contains 
the query variables and no function symbols. When L contains at least one resource pred- 
icate, and no EDB predicates, it constitutes a complete folding; otherwise L may contain 
some EDB predicates and hence it constitutes a partial folding, 
end 

In our figures, we show illustrative derivations, but not the detailed steps leading to that derivation 
that may have arisen by backtracking. Starting with ^ G, we essentially find a linear derivation 
using C that results in a clause <— L that contains the query variables. If L contains at least one 
resource predicate, and no EDB predicates, then it constitutes a complete folding; otherwise L may 
contain some EDB predicates, then it constitutes a partial folding. Note that when the algorithm 
terminates with an answer, backtracking may find additional answers. 

At any point if an integrity constraint subsumes a clause in the linear derivation, the process backs 
up because the query cannot have any answers along that path. We have omitted this step since 
subsumption is time consuming. An algorithm for subsumption may be found in [CL73] . If a clause 
in a derivation contains a constant, the expansion of a clause may allow us to derive a solution. 
This modification is omitted from the algorithm. 

Next wc show that the Folding Algorithm is sound or correct. By this we mean that every tuple 
obtained by solving the query fq{X) is also obtained by solving the query q{X). That is, every 
answer to a folded query is an answer to the original query. 

Theorem 4.1. The Folding Algorithm is sound (correct). 

Proof. Using the notation q{X) G{Y) and fq{X) L{Z), by the soundness of resolution, we 

obtain 

(- G{Y))uC^i^L{Z)), 

so, by logical equivalence, we have, 

L{Z)LSC\=^G{Y). 

Suppose that a is a solution to fq{X) in DB. This means that there is a b, with b[X] = a, 
such that DB \= Lip). Also, for every formula C G C, DB \= C. Therefore, there is a d, 
where d[X\ = a, such that DB \= G{d). But this means that a is a solution to q{X) in DB. 
□ 
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4.4 Additional Examples 



Next we apply our algorithm to various examples considered by researchers and show that our 
algorithm can be used to obtain the same results. We start by taking two examples from [Qia96]. 
In that paper the EDB consists of six relations that represent a patient record database: 

Example 2. (Examples 2 and 3, Qian [Qia96]) 

Patients (patientJd, clinic, dob, insurance) 

Physician (physicianJd, clinic, pager _no) 

Drugs (drug_name,generic?) 

Notes (note_id,patient_id,physician_id,note_text) 

Allergy (note_id,drug_name, allergy _text) 

Prescription(note_id,drug_name,prescription_text) 

The IDB and IC are empty; hence Case 1 of the IC Cases (see page 10) applies. The ResDB 
consists of two relations, Drug_Allergy and Prescribed_Drug. For convenience we write 
Drug_Allergy as ri and Prescribed-Drug as r2- In Datalog they are expressed as: 

n (Xi , X2, Xg) ^ notes{Ui,Xi , U2, U^),allergy{Ux , X2, X3) 

r2{Yi,Y2, 13, ^4) ^ notes{Vi, ¥±,¥2, V2),prescription{Vi,¥3, V3),drugs{¥3, 14). 



Preprocessing yields two Clark completion resource rules for ri and three Clark completion 
resource rules for r2- In the following formulae the functions fi, gj are abbreviations for the 



Skolem functions fi{Xi,X2,X3) and gj{Yi,Y2,¥3,Yi), respectively. 

CCrrl : notes(/i, Xi, /s, /a) ^ ri{Xi,X2,X-i) (11) 

CCrr2 : allergy{fi, X2, X^) ^ n{Xi,X2,Xs) (12) 

CCrr3 : noies((?i, Fi, F2, 52) ^ r2(Fi, ^2, 13, n) (13) 

CCrrA : prescription(gi,Y^, g:^) ^ r2(Yi, ¥2, Is, I4) (14) 

CCrrb : drugs{¥3, ¥i) ^ r2{¥i,¥2, ¥3, ¥^) (15) 



Let's consider first the query of Examples 2 and 3 of [Qia96]: 



q{X, ¥) :^ notes{Wi,X, W2, W3),allergy{Wi,¥, W4), noies(W5, X, WQ,W7),prescription{W5, ¥, 

Again, as in our first example, we start with the body of the query to find a derivation: 

^ notes{Wi,X, W2,W3),allergy{Wi,¥, W4), notes{W5, X, Wq, W7),prescripUon{W5, ¥, Ws) 

The derivation is shown in Figure 2. Four of the Clark Completion resource rules are used. 
The rewritten query at the bottom of the tree, 

^n{X,¥,X3),r2{X,¥2,¥,¥i) 
consists of only resource predicates. 

Now we consider Example 6 of [Qia96]. 
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Query :^ notes{Wi, X, W2, Ws), aUergy{Wi, Y, W4), notes {Wrj, X, Wq, Wj),prescription{Wr,, Y. Ws) 

CCrrl : notes{h, Xi, h) ^ n{X,, X2, X3) 

^ ri{X, X2, X3), allergy{h, Y, Wi),notes{W^, X, W^, W^),pr-escription{W^, Y, W^) 

CCrr2 : aUergy{h, X^, X^) ^ ri{X,,X2,Xs) 

^^^^^^-^^X^lY^^lW^i 

^ ri{X, Y, W4),ri{Xi, Y, W4),notes{W5, X, We, W7),prescription{Wr„ Y, Ws) 

factor 

^ ri(X, r, W4),notes{W5, X, Wq, W7),prescription{W^, Y, Wg) 

CCrr3 : notes{gi, Fi, Y^, ^2) ^ r2{Yi, Y2, Ys, Y^) 

^^^^^^j!hlgx,Y^lX, Y2/We, W,/g2} 

^ ri{X, Y, VF4), r2(X, We, Y3, Y4),prescription{gi, Y, Ws) 

CCrrA : prescription{gi, Y3, 53) -s- r2{Yi, Y2, Y3, F4) 

^ n{X, Y, W^),r2{X, We, Y^, Y,),r2{Y,, We, Y, Y,) 

factor 

^n{X,Y,W^),r2{X,We,Y,Y,) 

Figure 2: Example 2 from [Qia96] 
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Example 3. (Example 6 of Qian [Qia96]) The EDB, IDB and IC are the same as before. ResDB 
consists of one relation r defined as follows: 



r{Xi, X2,X3) <— notes{Ui, Xi,X2, U2)-iPrescription{Ui, Xs, U3), drugs{X^, U4) 



Preprocessing yields three Clark completion resource rules for r as follows: 



CCrrl : notes{h, X^, X2, f2) ^ r{Xi,X2,X3) 



(16) 



CCrr2 : prescription(fi, X-^, f^) ^ r{Xi, X2, X3) 
CCrr3 : drugs{Xs, U) ^ r(Xi, X2, X3) 



(17) 
(18) 



The query of this example is: 

q{X, Y) :<— patients{X, Wi, W2, medicare), notes{W3, X, W4,, W5),prescription{Ws, Y, Wq), drugsiY, 

For simplicity, we did not place the query in expanded form. If we had, at the end we would 
have had to change the new variable back to the constant which it replaced. The derivation 
is shown in Figure 3. Two of the Clark completion resource rules are used. The rewritten 
query at the bottom of the tree 



consists of the resource predicate (replacing two extensional predicates) and two extensional 
predicates. This is an example of a partial folding. 

5 Handling Functional and Inclusion Dependencies 

This section illustrates the use of our algorithm in the special cases where the integrity constraints 
are functional and inclusion dependencies. As explained earlier, the presence of functional depen- 
dencies means that Case 3 for ICs applies (see page 10). [DGQ96] gives algorithms in the presence 
of functional dependencies. Their basic idea is to use a functional dependency to decompose a 
relation into several relations, using a lossless join decomposition, and then apply the standard 
folding algorithm. We show that such a decomposition is not necessary. Instead of decomposing a 
relation that contains a functional dependency, we generate additional Clark completion rules and 
then apply our standard algorithm. 

5.1 Example with Key Constraint 

We consider how our algorithm applies to Example 3 of [DGQ96]. The EDB contains three 
relations, again involving a patient record database as follows: 



patients{X, VFi, W2, medicare), r{X, W4,Y),drugs{Y, no) 
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Query patients{X, Wi, W2, medicare) ,notes{W^, X, W4, W5) , prescription{W3, Y, Wq), drugs{Y, no) 

CCrrl : notes{h, X,, f^, h) ^ n{X^, X^, X^) 

{Ws/h, X^jX, X^/W^, W,/h} 

patients{X , Wi, W2, medicare), r{X, W4, X^) , prescription{fi, Y, Wq), drugs{Y, no) 

CCrr2 : prescription{fi, X3, /g) ri{Xi, X2, X3) 

(Xs/Y,We/fs} 

patients{X , Wi, W2, medicare), r{X, W4, Y),r{Xi, X2, Y), drugs{Y, no) 



factor 



<— patients{X , Wi, W2, medicare), r{X, W4, Y), drugsiY, no) 
Figure 3: Example 6 from [Qia96] 

Patients (name,dob,insurance) 

Procedure (patient_name,physician_name,procedurejiame,time) 
Insurer (company,address,phone) 

Again, IDB = 0. The ResDB consists of two relations ClinicaLHistory and Billing which we 
write as ri and r2 with the following definitions: 

ri{Xi,X2,X^,X/Cj ^ patients{Xi,X2, Ui, U2),procedure{Xi, U3,X3,X4) 

'^2(^1, ^2, Ys) ^ paUents{Yi, Vi,Y2, V2), insurer{V2, 13, V'a) 

The IC contains the key constraint patients : patient Jd clinic, dob, insurance, written in Cat- 
alog as three clauses: 

X2 = Y2 ^ patients{Xi, X2, X3, X4),patients{Xi,Y2, 13, 14) 
X3 = 13 ^ patients{Xi, X2, X3, X4^),patients{Xi,Y2,Y3,Y4) 
X4 = Y4 ^ patients{Xi, X2, X3, X4),patients{Xi,Y2,Y3,Y4) 



Preprocessing yields four Clark completion resource rules for ri and r2 as follows: 

CCrrl : patients{Xi, X2, fi, /2) ^ ri(Xi, X2, X3, X4) 

CCrr2 : procedure{Xi, f2,X^,X4) ^ ri{Xi,X2,X^,X4) 
CCrr3 : patient s {Yi, gi,Y2, 92) ^2(11, Y2,>3) 
CCrrA : insurer{g2, Y3, 53) <- r2(yi, I2, ^^3) 



(19) 

(20) 
(21) 
(22) 
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Query patients{X , Y, Z, W) 

CombinedCCr : patients{Xi, X2, 12, /2) ^ ri{Xi, X2, X3, X^,r2{Xi, Y2, 

X2/Y,Y2/Z,W/f2} 

^n{X,Y,X^,X^),r2{X,Z,Y^) 

Figure 4: Key Constraint Example 

As we will show below, in the case of a key constraint it is sometimes possible to combine Clark 
completion resource rules. In this particular case the rules Equation 19 and Equation 21 can be 
combined to yield 

CombinedCCrr : patients{Xi,X2,Y2, /2) ^ ri(Xi, X2, X3, X4), r2(Xi, Ys, >3) (23) 
The query of this example is: 

q{X, Y, Z) :^ patients{X, Y, Z, W) 

As shown in Figure 4, starting with the body of this query and using the combined Clark Comple- 
tion resource rule, the derivation takes one step to obtain 

^ri{X,Y,Xs,X4),r2{X,Z,Ys) 
We note, however, that we could not answer the query 

q{X, Y, Z,W):^ patient s{X, Y, Z, W) 
this way because W does not appear in the folded query. 
5.2 Combining Clark Completion Resource Rules 

In the presence of functional dependencies it is possible under certain conditions to combine Clark 
completion resource rules for the same predicate in such a way that function symbols are replaced 
by variables. Using the combined rules may simplify the derivation of the folded query. In this 
subsection we deal with the special useful case of key constraints. 

We start by introducing notation involved in combining two Clark completion resource rules. The 
same basic method as described below will handle more than two rules. We assume that there exist 
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two such rules of the form 

p{Xf)^ri{X) (24) 
p{Yg) ^ r2(y) (25) 

where p is an n-ary predicate, Xf is an n-tuple all of whose variables are in X and may contain 
functions symbols fi, and Yg is an n-tuple all of whose variables are in Y and may contain functions 
symbols gj. For any tuple U we write U[i] for the i-th component of U. Define the n-ary tuple Z 



Z\i] 



Yg[i] if Xf[i] is a function symbol and Yg[i] is a variable 
Xf[i] otherwise 



Also define the tuple Y/{Xk) to have the same number of components as Y and defined as 



Y/{Xk)\i] 




ifl<i<k 
if k <i 



Proposition 1. For the two rules given in 24 and 25, if Xf[i] and Yg[i] are variables iov 1 < i < k 
and the first k columns of p form a key for p, then the combined Clark completion resource 
rule written as 

p{Z)^n{X),r2{Y/{Xk)) (26) 



is also a valid rule. 
Proof. Proof: By 24 

By 25 



piXf)^ri{X),r2(Y/{Xk)) (27) 



p{Yg/{Xk)) ^ riiX),r2{Y/{Xk)). (28) 

By the hypothesis that and Yg[i] are variables for 1 < i < k, we obtain -^/[i] = 

Yg/{Xk) for 1 < i < k. As the first k columns of p form a key, the corresponding elements 
of Xf and Yg/{Xk) must be equal. Since Z contains the first k columns of Xf and the rest 
of the columns are from Xf or Yg, 

p{Z)^n{X),r2{Y/{Xk)) (29) 

follows. □ 

Going back to the example of Section 4.1, CCrrl and CCrrS are two Clark completion resource 
rules for the patients predicate. The first attribute is the key and both have a variable for the first 
attribute: Xi and Yi. Now set Yi = Xi. This forces gi = X2, fi = Y2, and g2 = /2, and we obtain 
the Combined CCrr. 



5.3 Using the Lossless Join Decomposition Property for Key Constraints 

In [DGQ96] the query in 4.1 is solved using a property of key constraints. Namely, the key 
constraint patients : patient Jd clinic, dob, insurance implies that the decomposition of the 
relation patients(j)atientJ,d, clinic, dob, insurance) to the three relations patl{patient Ad, clinic). 
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pat2{patientJ,d, dob), pat3{patientJd, insurance) is a lossless join decomposition. Therefore we 
can deal with the three relations patl, pat2, patS instead of patients. 

Now, the Res_DB relations are defined as follows: 

ri{Xi,X2,X3, X4) ^ patl{Xi,X2),pat2{Xi, Ui),pat3{Xi,U2),procedure{Xi,U3, X3, X4) (30) 

r2iYi,Y2,Y3) ^ patliYi,Vi),pat2{Yi, Y2),pat3{Yi, V2), insurer (y2,Y3, V3) (31) 
and there are eight Clark Completion resource rules: 

CCrrl : patl{Xi,X2) ^ ri(Xi, X2, X3, X4) (32) 

CCrr2:pat2{XiJi) ^riiXi,X2,X3,X4) (33) 

CCrr3:pat3(Xij2)^n{Xi,X2,X3,Xi) (34) 

CCrrA : procedure{Xu fs, X3, X4) ^ ri(Xi, X2, X3, X4) (35) 

CCrr5 : patl{Yi, gi) ^ r2{Yi, Ys, ^3) (36) 

CCrrQ : pat2(Yi, Fs) ^ r2(Yi, ^2, Y3) (37) 

CCrrl : pat3{Yi, 52) ^ r2(Yi, ^2, is) (38) 

CCrrS : msurer(52, 13, 53) ^ ^2(^1,^2, is) (39) 



The query 
is rewritten as 



q{X, Y,Z):^ patients{X, Y, Z, W) (40) 



q{X,Y,Z) -.^ patl{X,Y),pat2{X,Z),pat3{X,W) (41) 
but pat3{X, W) is superfluous, because the query does not contain W, hence we obtain 

q{X, Y, Z) :^ patl{X, Y),pat2{X, Z) (42) 

The derivation is shown in Figure 5. Two of the eight Clark Completion resource rules are used. 
The final query is the same as the one we obtained using the Combined Clark Completion rule. 
Again, if the query were 

q{X, y, Z, W) :^ patients{X, Y, Z, W) (43) 

then we could not obtain a complete folding because after applying CCrrl and CCrrQ we would 
be left with 

^ n{X,Y,X3,X4),r2{X,Z,Y3),pat3{X,Y) (44) 

and now applying applying either CCrrS or CCrrl would lead to a function symbol for one of the 
variables in ri or r2- 
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Query :^ patl{X, Y),pat2{X, Z) 




CCrrl : patl{X, Y) ^ ri(Xi, X2, X3, X4) 



Xi/x,X2/y} 



ri(X,y,X3,X4),pat2(X,Z) 



CCrr6 : pai2(yi, ^2) ^ r2(yi, ^2, 



{yi/x,y2m 

ri(X,F,X3,X4),r2(X,Z,F3) 

Figure 5: Key Constraint Example Using Relation Decomposition 



5.4 Decomposition Cannot Always Handle Functional Dependencies 

Our next example also contains a functional dependency, but in this case the functional dependency 
cannot be handled by a decomposition. However, our algorithm can be used to obtain a folding. 
The EDB consists of two relations 
Pi(X,y) and V2{X,Y) 

the IDB is empty and the IC contains the key constraint p2 '■ X ^Y , written as: 

Y = Y' ^P2{X',Y),p2{X',Y'). 

Note that neither EDB relation can be decomposed. 

The Res_DB consists of two relations: 



rr{X,Z)^Pr{X,W),p2{Z,W) 

r2{X,Y)^P2{X,Y) 



(45) 
(46) 



Preprocessing yields two Clark completion resource rules for ri and one Clark completion resource 
rule for r2 as follows: 

pi(X,/)^ri(X,Z) (47) 



P2{Z,J)^rr{X,Z) 
P2{X,Y)^r2{X,Y) 



(48) 
(49) 



Consider the query 



q{X):^p^{X,c) (50) 
The solution is given in Figure 6. We start by expanding the query, that is, rewriting the query to 

q{X):^p^{X,W),W = c (51) 

in order to take the constant out of the predicate allowing for a substitution later. All three Clark 
completion resource rules are used in the derivation to obtain 



ri(X,Z),r2(Z,c) 



(52) 
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Query :^ Pi{X, W),W^c 

IC:Y^Y'^P2(X',Y),p2(X',Y') 

[Y/W, Y'/c} 

P,{X,W),P2{X',W),P2{X',C) 

^CCrrl:pi{X,f)^n{X, Z) 

ri{X,Z),p2{X'J),p2{X',c) 

^ CCrr2:p2{ZJ)^n{X,Z) 

{X'/Z} 

niX,Z),riiX,Z),p2iZ,c) 

factor 
ri(X, Z),p2{Z, c) 

CCrrS : p2{X,Y) ^ r2{X,Y) 
{X/Z, Y/c} 

^n{X,Z),r2{Z,c) 

Figure 6: Functional Dependency That Cannot Be Handled by Decomposition 
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which is a complete folding. 



We note a generalization of the above result: 
Suppose the EDB consists of n relations 

Pi{X, Y2, . . . , Yn),P2{X, Y2), . . . ,Pn{X, Yn) with the IC containing the n — 1 key constraints pi : 
X —>-Yi for i = 2, . . ., n and the Res_DB consisting of the n relations ri, r2, . . ., r„, defined as 

n{X, Z)^pi{X,W2,..., Wn),P2{Z, W2),...,Pn{Z, W„) (53) 

r2{X,Y)^P2{X,Y)... (54) 



rniX,Y)^Pn{X,Y)... (55) 

where the . . . indicate the possibility of additional predicates. 
Then the query 

q{X):^p,{X,C2,...,Cn) (56) 

can be folded as 

ri{X, Z),r2{Z,C2),..., rn{Z, Cn) (57) 



5.5 Functional Dependencies and Recursion 

The following example, discussed in [DGLOO] illustrates how functional dependencies may introduce 
recursion. We now consider how this is handled in our approach and show that, although the non 
built-in predicates are not recursive, recursion is introduced by the recursive transitivity rule of 
equality. 

Example 4. EDB: schedule{Airline, Flight — No, Date, Pilot, AirCraft) 
IDB: 
IC: 

Ai = A2^s{Ai,Ni,Di,Pi,Ci),s{A2,N2,D2,Pi,C2) 

(i.e., the functional dependency Pilot — > Airline) 
A, = A2 ^ s{Ai,Ni,Di,Pi,Ci),s{A2,N2,D2,P2,Ci) 
(i.e., the functional dependency Aircraft Airline) 
ResDB: r{D, P, C) ^ s{A, N, D, P, C) 
Query: q{P) :^ s{A, N, D, mike, C), s{A, N' , D' , P, C) 

CCrr : s{J{D, P, C),g{D, P, C), D, P, C) ^ r{D, P, C) 
In [DGLOO], they discuss how they obtain an infinite set of folded queries, one for each n of the form: 

qn{P)^r{Di,mike,Ci),r{D2,P2,Ci),r{Ds,P2,C2),r{D4,P3,C2),..., 
r{D2n-2,Pn, Cn-l) , r{D2n-l, Pn, Cn),r{D2n, P, Cn) 



Using our approach we start with the expanded clause: 



23 



^ s{A, N, D, P', C), s{A', N', D', P, C'),A = A', P' = mike 

By applying the functional dependency C ^ A, and factoring the resolvent clause twice, and 
applying the CCrr twice, we obtain the clause 

^ r{D, P', C),r{D', P, C),P' = mike 

which is equivalent to qi{P),i > 1. Although we do not provide the details here, in order to 
obtain the other qi{P), we need to use the recursive transitivity rule of equality as well as the two 
integrity constraints several times and factoring several times.. 

5.6 Inclusion Dependency Example 

The last example of this section illustrates the use of an inclusion dependency. This example is 
taken from Example 4.3 of [Gry98] and contains an example given earlier with some modifications: 
The EDB contains four relations: 
Pat lent (name , dob , address , insurer) 

Procedure (patient_name,physician_name,procedure_name,time) 

Insurer (company,address, phone) 

Event (event_name,description,patient_name,location) 

The IDB is empty. 

The IC contains a single inclusion dependency: 

Procedure (procedure-name, patient-name) C Event( event-name, patient-name) 
written in Datalog as 

eventiX'^, fi,X[,g) ^ procedure{X[, X!^, X'^, X'^). 

As before, the Res_DB consists of two relations ClinicaLHistory and Billing, written as ri and 
r2 with the following definitions: 

ri{Xi,X2,X3,X4) patient{Xi,X2, Ui, U2),procedure{Xi,Uz-, X2,, X^) (58) 

r2(Yi, ^2, >3) ^ patient{Yi, Vx,Yi. ^2), insureriy^, Fg, ^3) (59) 
Preprocessing yields the same Clark completion resource rules as before: 

CCrrX : patient{Xi, X2, fi, f2) ^ ri{Xi, X2, X3, X^) (60) 

CCrr2 : procedure{Xuh,X^,X^) ^ ri(Xi, X2, X3, X4) (61) 

CCrr?. : patient{Yi,gi,Y2,g2) ^ r2(yi, F2, l^s) (62) 

CCrrA : insurer{g2, 13, 53) ^2(11,1:2, ^3) (63) 
The query asks for the names of events recorded for patients born before 1930: 

g(X3) ■.^patient{Xi,X2,Yr,Y2),event{X^,Yz,Xr,Y4),X2 < 1930 (64) 

The derivation is shown in Figure 7. Using the inclusion dependency and two Clark completion 
resource rules we obtain the answer as 

^ ri{Xi,X2,X3,X4),X2 < 1930 (65) 
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Query :^ patient{Xi, X2, Yi, Y2), event{X^, Fs, ^1, 14), ^2 < 1930 

IC : event{X'^, f, X[,g)^ procedure{X[, X^, X^, X'^) 
,Ys/f,Xi/X^,Y,/g} 

patient{Xi, X2,Yi,Y2),procedure{Xi, X" 2, X3, X'4), X2 < 1930 

CCrrl : patient{X^, X2, /i, /a) ^ n{X,, X2, X3, X4) 

ri(Xi,X2,X3,X4),procedMre(Xi,X^,X3,X^),X2 < 1930 

CCrr2 : procediire(Xi, /a, X3, X4) ^ ri(Xi, X2, X3, X4) 
XVX4} 

ri(Xi,X2,X3,X4),ri(Xi,X2,X3,X4),X2 < 1930 

factor 

ri(Xi,X2,X3,X4),X2<1930 

Figure 7: Inclusion Dependency Example 
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6 Multiple Definitions for Resources 



In this section we consider the case where there arc resources that were developed from several 
definitions or queries. This is in contrast to work in the previous section where it was assumed 
that a resource was constructed from a single conjunctive view. We also allow queries that arc in 
disjunctive normal form, that is, the queries involve disjunctions of conjunctions. Our assumptions 
for the database are different from the ones given at the beginning of Section 4. We list our 
assumptions here. 

1. No formula contains negation. 

2. Each IDB predicate may be defined by multiple safe, conjunctive, non-recursive function-free 
Horn rules. 

3. Each ResDB predicate is defined by a set of safe conjunctive function-free Horn formulas on 
EDB and/or IDB predicates. 

4. Each IC clause is a safe function-free formula of the form G <— F where G is a disjunction 
of zero or more EDB predicates and F is a conjunction of EDB predicates. 

5. Each query has the form q G where G is in disjunctive normal form on EDB and 
IDB predicates. 

6. The database includes axioms for built-in predicates as needed. 

Resource definitions may contain constants. We assume that the resource rules are expanded, and 
rectified (see page 6) so that a single resource that has multiple definitions is defined by the same 
variables in each definition. 

We first provide a simple example, that illustrates what has to be done to handle such cases. 



The integrity constraint is a non-Horn clause and states that whenever pi (a) is in the database, 
for some constant a, then cither P2{(i) or ^3(0) or both are in the database. We also have the 
following resource definitions: 
ResDB: 



Example 5. (Multi-Resource Example) EDB: pi{X), P2{X), pz{X). 
IDB: 0. 



IC:p2(^)Vp3(^) ^Pi{X). 



(66) 




(67) 
(68) 




(69) 



CCrrl :pi(X) Vp3(X) ^r(X)., 



(70) 
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Query VpsW 




Figure 8: Deriving Resource Queries 

and 

CCrr2 : p2{X) V pi{X) ^ r{X). (71) 

Note that the use of more than one conjunctive definition of a resource leads to non-Horn clauses. 
Such clauses are outside of Datalog. Hence to work with such clauses, we need, in general Dis- 
junctive Datalog, that is Datalog^ . 

Next, consider the query: 

q{X):^pi{X)^Ps{X). (72) 

Now we have to determine if we can derive answers to the query from the EDB, the IDB, and 
the CCrr (that is, rule 70 and rule 71). Figure 8 shows the steps required to achieve this result. 
Because the query is a disjunction, ^ Pi{X) Vp3(X), the derivation can be split into two clauses: 
<— pi(X) and <^ ps{X). The derivation terminates with a clause that contains only the resource 
predicate. We will show that all answers obtained by querying the resource will yield correct 
answers to the query q{X). 



Example 6. (Example 5 Continued) Consider the same example, where the query was: 

q{X) -.^ pi{X). (73) 

Figure 8 applies without the right branch ^ p3(X). We obtain as bottom clause, 
P3{X) ^ If written with an empty head, wc obtain ^ r{X),^p3{X). This is a 

query that contains logical negation. We cannot replace this by default negation since default 
negation does not imply logical negation. We can see the problem by assuming that we have 
'^(c),P3(a),;>3(^)- Psic) may cither be true or false, but default negation assumes its falsity. 
We might have r(c), -1^3(0) is not true, while r{c),not P3{c) is true. Hence we could obtain a 
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non-sound answer q{c) by using default negation. However, if one knew that the ps predicate 
were complete (by the Closed World Assumption) one could write an axiom 

-ips <— not p3 (74) 

in which case, another resolution step would be applied to obtain 

^r{X),notp3{X) (75) 

The above query is a partial folding. The negation of the atom P3{X) does not lead to complications 
since the resource r{X) makes the formula safe. 

The folding algorithm we now present uses set-of-support as its inference method. In this inference, 
there are two types of clauses: one set, referred to as T, is given support. This set consists of the 
conjunctions that are part of the query. The second set consists of a satisfiable set of clauses, the 
remaining clauses in the set C, which we may refer to as set 5". A set of support resolution is a 
resolution of two clauses that are not both from S — T ([CL73]). 

Before the algorithm commences we assume that there is a test to determine which of the three 
cases (see page 10) applies to ICs. In Case 1, nothing has to be done. In Case 2, we assume that 
the set-of-support derivation is modified to include a check to determine if the clause, L, that has 
been generated, satisfies the bounding condition, and if it does, then backtracking occurs. In Case 
3, a depth bound k is specified and if the depth is reached, backtracking occurs. We also assume 
that if there are built-in predicates such as =, 7^, >, then the input clauses are placed in expanded 
form. If there are no built-in predicates, it is unnecessary to do the expansion. 

Folding Algorithm 2 (Finding Multiple Foldings) 

Input: C: the set of clauses in the EDB, IDB, IC, and CCrr, and 

the query, q{X): G{Y), where X CY, and G is in disjunctive normal form. 

Output: A proof tree starting with the query, 
begin 

Split the query, ^ G, into a set of clauses, (each of which has support). Find a proof tree 
using set-of-support resolution that results in leaf nodes of the form L that contains the 
query variables and no function symbols. 

end 

We consider four cases for the clauses that are leaf nodes in the proof tree. 

1. The clause has an empty head and a conjunction of resource and built-in predicates in the 
body. 

2. The clause has an empty head and a conjunction of resource, EDB and built-in predicates 
in the body. 

3. The clause has no resource predicates. 

4. The clause has a non-empty head. 



28 



Theorem 6.1. Consider a clause that is a leaf node of the proof tree eonstrueted by Folding 
Algorithm 2. In Cases 1 and 2, the query can be answered using the clause and every 
answer obtained that way is sound. These cases are instances of complete and partial folding 
respectively. In Case 3, the clause is not a folding. In Case 4, the clause is not a folding, but 
if the CWA can be applied to all the predicates in the head, then the atoms in the head can 
be moved to the body using default negation and Case 2 or Case 3 becomes applicable. 

Proof. For a clause of Case 1 or 2, the proof of Theorem 4.1 applies. In Case 3, there is no folding. 
In Case 4, without the CWA, the clause is not a query (since it does not have an empty 
head). The disjunction of atoms in the head of such a clause can be moved to the body and 
negated (for example, by -ip). Since the CWA applies, the axioms -ip not p can be added 
to the set of clauses C. These axioms may be used to eliminate the logically negated atoms 
and replaced by default negated atoms. When this is done, this case reduces to Case 2 or 
Case 3. □ 

We illustrate the theorem with an example: 

Example 7. Consider the EDB with the relations pi{Xi,Yi), . . . ,^7(^7, Yj) Assume there are no 
IDB predicates and no ICs. Let there be the following resource rules: 
r,{X)^pi{X,Z),p2{X,Z),Z^a 
r2{X,Y)^P5{X,Z),peiZ,Y) 
r2iX,Y) ^ priX,Y) 

The Clark Completion resource rules are: 

CCrrl:pi{X,f)^n{X) (76) 

CCrr2:p2iX,f)^ri{X) (77) 

CCrrS :f^a^ri{X) (78) 

CCrr4 : p^iX, /) V priX, Y) ^ r2{X, Y) (79) 

CCrr5 : peiX, f) V pt{X, Y) ^ r2(X, Y) (80) 



Let the query be given by: 

q{X) ^ pi{X, Z)A{{p2{X, Z)AZ + a)\Jp^(X, V))y{p^{X, Z)Ap^{X, Z))\j{p^{X, V)Apq{V, Y)) 
We construct the proof tree in Figure 9 after converting the query to disjunctive normal form. 
There are four leaf nodes in Figure 9. The leftmost leaf node has only resource predicates, 
and hence, it can be used to obtain correct answers. The second leftmost leaf node has both 
a resource predicate and an extensional predicate. Hence, it is a partial folding and if the 
database and the resource predicate were used, correct answers would be found. The third 
leftmost leaf provides nothing with respect to resources. The final leaf node has something 
in the head of the clause and provides no useful information (without the Closed World 
Assumption on ^7). If the CWA applies to p7 then we obtain a partial folding: 
^r2{X,Y),not P7{X,Y). 

In the case of the query q{X,Y, Z,W) given at the end of Section 4.1, no resolution steps are 
possible, hence Case 3 applies. 

Answering the query with resource rules as described in the above theorem does not mean that 
all answers to the original query have been found. We want to determine when using resource 



29 



Query :^ Z) A p2iX, Z)AZ^a)V Z) A p^iX, V))V 

ips{X,Z) Ap4X,Z)) V {p,{X,V) Ape{V,Y)) 



^,{X,Z),p2{X,Z),Z^a 

CCrrl:p,{XJ)^n{X) 

{Z/f} 
n{X),p2{X,f),fj^a 

CCrr2:p2{XJ)^n{X) 



ri{X),n{X),fj^a 
CCrrS : f ri{X) 



v,ix),nix),nix) 



factor ri{X) twice 



p,iX,V),peiV,Y) 

CCrrA : p^{X, f) V priX, Y) ^ r2(X, Y) 

P7{X,Y)^r2{X,Y),pe{f,Y) 

CCrrb : p^{f,Y) Vp,{X,Y) ^ r2{X,Y) 



Pr{X, Y) V pr{X, Y) ^ r^iX, Y),r2{X, Y) 



factor both P7{X, Y) and r2{X, Y) 



p,(X,Y)^r2{X,Y) 



n{X) 



PsiX,Z),p,{X,Z) 

p,{X,Z),ps{X,V) 

CCrrl:p,{XJ)^n{X) 

{Z/f} 

^r^{X),ps{X,V) 
Figure 9: Four Cases of Theorem 4.1 Ihustrated 
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predicates (or partially using resource predicates) will provide all the answers, that is, if the method 
is complete. This assumes that the database from which the resources were constructed has not 
been updated. 

When we do not have completeness using resource predicates we may still wish to find all answers. 
This may be done as follows. Let the answers be given by the formula defining Qresource, and let 
the query be Q. Then the remaining answers may be found by using the query: Q — Qresource- In 
some cases this may be simpler than trying to find all answers to Q. For example, if the query Q 
is given by Q : (p4 Aps) V {pi A j)2) V (p2 Apa), and n : pi Ap2 and r2 ■ P2 /\P3- then the remaining 
answers can be expressed by P4 Ap^, and hence it is easier in this case to find the complete set of 
answers to the query by using this subformula together with the answers found by Qresource than 
having to answer the original query, Q. 

In the following algorithm, where we test for completeness, we have to consider two cases. In the 
first case there are no built-in predicates. In the second case, we allow built-in predicates. In this 
case we have a bound on the depth of the proof tree. This algorithm is based on the subsumption 
algorithm. In the subsumption algorithm two clauses, C and D are given and the algorithm checks 
to determine if C subsumes D, which means that there is a substitution 9 such that is a subset 
of D, where the clauses C and D are considered as sets of literals. In our completeness test, D 
corresponds to the original query and the C is the union of the leaf nodes. What we are trying to 
show is that the original query is implied by the union of the leaf nodes using the IDB, IC, and 
ResDB- 

Completeness Test Algorithm 

Input: The IDB, IC, ResDBj the non-negated form of the query, where each variable is replaced 
by a unique new constant, and the leaf nodes of Cases 1 and 2 of Folding Algorithm 2. If 
there are built-in predicates, the appropriate axioms, such as equality, are also included as 
well as a bound on the depth of the proof tree. 

begin 

We give support to the clauses of the leaf nodes obtained in Algorithm 6, and apply set-of 
support resolution on the input clauses. If we obtain the null clause we terminate. If we allow 
built-in predicates, when we reach the depth bound, we terminate. 

end 

In the following theorem we show when completeness is obtained. 

Theorem 6.2. If the Completeness Test Algorithm yields the null clause, then the set of answers 
obtained by solving the folded queries is complete. If the Completeness Test Algorithm ends 
without reaching the null clause, then the set of answers obtained by solving the folded queries 

may not be complete. 

Proof. Suppose that the null clause has been found. Let o be a solution to q{X). Writing 
q{X) G{Y), we obtain DB \= G(b), where b[X] = a. That is, the constants obtained 
by answering the query only return that part of the tuple required by the variables in X. 
The projection of b on the variables in X yields a. Suppose that the leaf nodes used in the 
completeness test are <— Li(Zi), Ln{Z„), where A C for 1 < i < n. Thus, 

IDB U IC U ResDB U {G{b)}U ^ Li(Zi) U . . . U ^ Ln{Zn) 
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form a contradiction. Hence, 



DB \= (IDB U IC U ResDB U {G{b)}) Li(^i) V ... V 



Since DB ^ IDB U IC U ResoB U 

DB \= Li{di) V ... V Ln{dn), for some di, 1 < i < n, where di[X] = a. Hence, the solution a 
to q{X) is also obtained by one of the Li,l < i < n. 

Suppose the null clause is not obtained. Proceeding as in the previous case, given a solution 
d to q{X) we cannot prove that for some di, where di[X] = d, DB \= Li{di) V ... V Ln{dn). 
Hence, it is possible that the solution d to q{X) may not be obtained by solving the folded 
query. □ 

Note that if the null clause is found in Theorem 6.2, and all of the leaf nodes are of case 1, we 
have query completeness for a complete folding, otherwise we have query completeness for a partial 
folding. 

We illustrate this theorem by reconsidering the example at the beginning of this section. The 
derivation is shown in Figure 10. The original query, <— pi{X) V P3(X) is changed to its non- 
negated form with the new constant k substituted for the variable X to become pi{k) M P2,{k,) 
We start with the rewritten query, the leaf node, <— r{X), and eventually obtain the null clause. 
This shows that the query using the resource predicate obtains all the answers to the original query. 
Both factoring and ancestry-resolution are used. 

7 Negation 

In this section we deal with stratified databases. Wc allow default negation in the IDB, IC, Resoe, 
and the query. Otherwise our assumptions on the database are the same as in Section 5. Problems 
arise when we have just one resource rule that contains a conjunction of atoms, such as, 



If we have the negation of pi{X) or P2{X) in the query with another atom, there is no way to 
resolve either of these atoms with the CCrr since the negated atoms do not appear on the left 
hand side of any inverse rule. To handle this, we represent a rule for the negation of the resource 

as. 



r{X)^p,{X),p2{X) 



(81) 



not r{X) ^ not {pi{X) , p2{X)) . 



(82) 




(83) 



(84) 



r'{X)^p[{X), 



(85) 



and 



r'{X)^p',{X) 



(86) 



32 



r{X) 



^p,{X),p2{X) 

P3W^PiW,Pi(^) 
factor 



r{X)^p,{X) 



ResnB:r{X)^p,{X),p2{X) 



IC:p2{X)Vps{X)^Pi{X) 



ResDB ■ r{X) ^ p^iX) 



ancestry with top clause 



Query : pi{k) V p3{k) 



ResDB : r{X) ^ p^iX) 



r{k) 



ancestry with top clause 



□ 



Figure 10: Use of Ancestry Resolution 
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Now, we have two rules that define r'{X). When we do the Clark completion of this atom, we 
obtain a disjunctive inverse rule, 

p[{X)\/p'^{X)^r'{X) (87) 

We must rename the negated atom in the query in the same manner. Since we have a disjunctive 
rule, as in Theorem 4.1, we may be in case (4) and not be able to compute an answer. 

We have to first deal with how to handle stratified databases. The underlying idea as developed by 
[ABW88, VG88] is that given a stratum, and a negation of a predicate in the body of a rule in that 
stratum, the predicate that is negated must have been defined in an earlier stratum, and hence, 
the predicate may be calculated. Thus its negation can be obtained. An algorithm to make this 
explicit may be found in [U1189] (Algorithm 3.6, Vol. 1). We adapt this algorithm by compiling 
the IDB predicates to rules that involve EDB predicates. We give two versions of the compilation 
process. In the first version, which requires a restriction on the types of formulae allowed, we do a 
complete compilation, so that the final rules involve EDB predicates only. In the second version 
we do not compile all IDB predicates. 

First, in order to do a complete compilation, we must restrict the type of formula that we allow 
for defining IDB and ResoB predicates. Namely, in addition to safety, we require that for all 
definitions the set of variables in the body be identical to the set of variables in the head. That is, 
there are no existential quantifiers in the right-hand side of the formula. We call such a formula 
extra safe. By the definition of safe formula, no variable may appear in the head that does not 
appear in the body. Now we show why there should not be any variable in the body that is not in 
the head. Consider the case where the IDB predicates p and t are defined in terms of the EDB 
predicates h, k, and s as follows: 

p{X,Y)^h{X,Z),k{Z,Y) (88) 

t{X, Y) ^ s{X, Y), not p{X, Y) (89) 

Note that the definition of p is safe but contains a variable Z in the body that is not in the head, and 
hence is not extra safe. In our compilation process, to be described below, we replace not p{X, Y) 
by the negation of the body of the definition of p, so we obtain the following two formulas as the 
compiled definition of t: 

t{X, Y) ^ s{X, Y),not h{X, Z) (90) 

t{X, Y) ^ s{X, Y),not k{Z, Y) (91) 

obtaining two unsafe formulas as well as losing the connection between h and k in t, namely, that 
t is the join of h and k. 

Now we describe the compilation algorithm. In a stratified database, each stratum is numbered in 
an increasing fashion. Since there is no recursion in the IDB, we can modify the strata so that each 
stratum contains definitions for one predicate. For example, if s and t originally had definitions in 
the same stratum and s is the head of a rule that contains t in the body of the rule, we move the 
definitions for s to the next higher stratum and adjust other stratum values accordingly. Call the 
predicate with definitions in stratum i, Pi. 

Compilation Algorithm 

Input: EDB (all predicates have stratum 0) and IDB (predicates with strata 1, . . . n). 
Output: The compiled IDB (each IDB predicate defined in terms of EDB predicates). 
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begin For stratum i = 1 To n Do (Note: EDB predicates are not compiled) 



1. li Pi has multiple definitions, i.e., there are multiple rules with pi as their head, replace 
these rules with a single definition by taking the disjunction of the bodies of the multiple 

definitions. 

2. Substitute for each predicate pj, in the definition of pi, its compiled form. 

3. Simplify the definition of pi by using De Morgan's rules and put in disjunctive normal 
form. 



The following result shows that the Compilation Algorithm does not change the extra safeness of 
predicates. 

Theorem 7.1. Suppose that every predicate in IDB is defined by extra safe rules. Then the 
compiled definitions are also extra safe rules. 

Proof. We proceed by induction on the strata. A predicate at stratum is extensional, and 

the statement is vacuously true. Let pi be the IDB predicate at strata i. By the inductive 
hypothesis, all predicates, pj, j < i, in the body of a rule for pi have been compiled to extra 
safe formulas. Replacing the multiple extra safe rules in step 1 by a single rule via a disjunction 
preserves the extra safe property. Since the compiled definition of each pj has exactly the 
same variables as pj, the substitution of step 2 also preserves the extra safe property. Finally, 
in step 3 using De Morgan's rules and converting to disjunctive normal form preserves the 
extra safe property as well. □ 

We now are able to handle stratified databases with extra safe rules as given by the following 
theorem. 

Theorem 7.2. Let DB be a stratified database where each IDB rule is extra safe. Compile the 
IDB predicates using the Compilation Algorithm to IDB^. Apply IDB^ to the ResoB 
rules to rewrite them in terms of compiled IDB predicates as Res^g . Rename every negated 
atom (not p) in the query and in Res^g to a new predicate (p ) in a consistent manner and 
add the integrity constraints ^ p,p . Then, the results of Theorem 6.1 and Theorem 6.2, 
that deal with the leaves of the proof tree, and the completeness test algorithm apply. 

Proof. By Theorem 7.1 the Compilation Algorithm preserves extra safety for the predicates of 
IDB^. Hence, the variables in ResDB and Res^g are the same. By renaming not p to 
p we eliminate negation and the assumptions of Section 6 (see page 26) are satisfied. The 
result follows because now, Theorem 6.1, and Theorem 6.2 apply, and the additional integrity 
constraints assure that it is not possible to have both p and not p at the same time. □ 

This theorem allows us to obtain proof trees and test for completeness. In actually computing 
a folded query, we need to change back the primed atoms p to not p. The following example 
illustrates the theorem. (We omit the transformation of not p to p .) 

Example 8. Let the EDB consist of: eoi{X,Y), eo2iX,Y,Z), eQ3{X,Y), and eQ4{X,Y). Let the 
IDB consist of the following clauses: 



end 



en {X, Y, Z) ^ eoi {X,Y), eo2 {X, Y, Z) 



(92) 



en (X, y, Z) ^ eo2 {X, Y, Z) , not eos (X, Y) 



(93) 
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ei2 {X, Y) ^ eo4 {X, Y) , not eoi {X, Y) (94) 

621 {X, Y) ^ eo3 {X, Y) , not ei2 {X, Y) (95) 

Now, let ResDB consist of: 

r{X,Y,Z)^eu{X,Y,Z),not ei2{X,Y) (96) 

The above database is not compiled. Using the algorithm given above, the following database 
is found, where all rules are written in terms of EDB predicates. The compiled definitions, 
IDB*^ become: 

eu{X, Y, Z) ^ eo2(X, Y, Z), (eoi(X, Y) V not eosiX, Y)) (97) 

621 (X, Y) ^ eo3{X, Y), {not eo4{X, Y) V eoi {X, Y)) (98) 
The compiled resource, Res^g becomes: 

r{X, Y, Z) ^ eo2{X, Y, Z), (eoi {X, Y) V not eo3(^, Y)), (eoi (X, Y), Vnot eoi{X, Y)) (99) 

The Clark Completion resource rules become: 

CCrrl : eo2{X, Y, Z) ^ r{X, Y, Z) (100) 

CCrr2 : eoi (XF) Vnot 603(^1^) ^ r{X,Y,Z) (101) 
CCrr3 : eoi(X, Y) V not eQ^{X, Y, Z) ^ r{X, Y, Z) (102) 
Let the negation of the query be: Q :<— e2i{X, Y) 

In Figure 11, we show that using the query and the Clark Completion resource rules yields a 
partial folding on the leftmost branch: 

^r{X,Y,Z),eoz{X,Y) (103) 

We now consider the second case where the rules are safe but not necessarily extra safe. In this case 
modify the Compilation Algorithm in step 2 so that predicates with a definition that is not extra 
safe are not compiled. In this case the end result of the compilation may contain IDB predicates 
in some definitions. We use the same notation for the compiled versions, i.e. IDB^ and Resgg 
as before. Theorem 7.2 extends to this case as follows. 

Theorem 7.3. Let DB be a stratified database. Compile the IDB predicates using the Com- 
pilation Algorithm modified as explained above to IDB*-". Proceed as in the statement of 
Theorem 7.2, compiling Resoe to Res^g and renaming the negated predicates. Then, the 
results of Theorem 5.1 and Theorem 6.2 apply. 

Proof. Similar to the proof of Theorem 7.2 except that the modified Compilation Algorithm is 
used and so the compilation process stops earlier for certain predicates. □ 

The following example illustrates the Theorem. (Again we omit the transformation of not p to p'.) 

Example 9. Let the EDB consist of: h{X,Y), k{X,Y), s{X,Y), and £{X,Y,Z). Let the IDB 
consist of the following clauses: 

p{X,Y)^h{X,Z),k{Z,Y) (104) 
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Query 621 (X, Y) 



IDB^ : 621 ^ eo^{X,Y){not e^{X,Y)y e^^{X,Y)) 




Figure 11: Stratification Example 
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Query :^ s{X, Y),k{Y,Z) 

CCrrl : s{X, Y) ^ r{X, Y, Z) 




^r{X,Y,Z),k{Y,Z) 

Figure 12: Safe Negation Example 

t{X, Y) ^ s{X, Y),not p{X, Y) (105) 

Now, let ResDB consist of: 

r{X, Y, Z) ^ t{X, Y),i{Y, Z, U) (106) 

Note that the definition for p is not extra safe. So in the definition of t{X, Y), not p{X, Y) is 
not changed. Thus, IDB^ = IDB and the compiled resource, Res^g becomes: 

r{X, Y, Z) ^ s{X, Y),not p{X, Y),£{Y, Z, U) (107) 

The Clark Completion resource rules become: 

CCrrl : s{X, Y) ^ r{X, Y, Z) (108) 

CCrr2 : not p{X, Y) ^ r{X, Y, Z) (109) 
CCrr?,a{Y,Z,f)^r{X,Y,Z) (110) 
Let the negation of the query be: Q :^ s{X, Y),k{Y, Z) 

In Figure 12 we show that using the query and the Clark Completion resource rules yields a partial 
folding: 

^r{X,Y,Z),k{y,Z) (111) 

Wc end this section by showing that any folded query resulting from Theorems 6.2 or 6.3 is safe. 
We start by proving a general result. 

Proposition 7.1. The resolution of two safe formulas (in Datalog with negation and disjunction) 
is a safe formula. 

Proof. Let the two safe formulas have the form: 
Ai,...,Ak^ Ak+i, ...,Am and 
Bi, . . . ,Be ^ Bi+i, ... , Bn, 

where each Ai, Bj, 1 < i < m, 1 < j < n is axi atom. We may assume without loss of 

generality that Ai is resolved with Bn to yield 

A2, ■■■ , Ak,Bi, . . . , 5^ <— Ah^i, ... , Ajn, -B^+i, • • • , Bn-l- 

Actually, the resolution involves a substitution 6, so here each Ai, Bj, 2 < i < m, 1 < j < n—1 
is really Ai6, BjO, but this can be ignored for the purpose of this proof because any changed 
variable in the head of a clause must be changed the same way in the body. 

We need to show that every variable in the resulting formula is limited. If X is a variable in 
any Ai, 2 < i < m, it must already be limited in ^fe+i, . . . , A^^ because the A formula is safe. 
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If X is a variable in any Bj, 1 < j < n ~ 1, that did not appear in Bn, it must already be 
limited in S^+i, . . . , Bn-i because the B formula is safe. Finally, if X is a variable in some 
Bj, 1 < J < n — 1, that was limited in the B formula in Bn, by the resolution X must now 
be limited in ^fc+i, . . . , A^- D 

Corollary 1. Every folded query resulting from Theorem 6.3 is safe. 

Proof. The query is safe and so are the rules in IDB and IC. Consider the way the Clark Com- 
pletion resource rules are obtained. In each such rule the body contains only the resource 
predicate and every additional variable in the body of an original resource rule becomes a 
function symbol. Since these function symbols cannot be iterated, they can be treated as 
constants. Hence the Clark Completion resource rules are also safe and the result follows 
from the Proposition. □ 



8 Recursion 



Query folding becomes problematic in the presence of recursion, since one does not know when to 
terminate recursion in a top-down approach. However, if it is known that the recursion is bounded 
[MN82, NS87], that is, the recursion is known to terminate after a number of stages using only 
intensional rules, then one can use the methods we describe in the previous sections to handle this 
case. In this section we assume that recursion is not bounded. 

Unlike the previous sections, the computations we present here follow the bottom-up approach 
rather than the top-down method we used earlier. The reason is that in the presence of recursion 
it is difficult to tell when all the solutions have been obtained in the infinite proof tree. In this 
case, we are not doing query folding, but we are doing query answering. That is, we do not find 
a rewriting of the query in terms of the resources. Actually, there are strong connections between 
the top-down and bottom-up approaches: Bry [Bry90] describes a combined top-down, bottom-up 
interpreter that incorporates the magic set technique used for recursion. 

We start with the case where the recursion occurs only in the query, so the IDB, IC, and Rgsdb 
contain no recursion. This is equivalent to the case where recursion appears in the IDB, a query 
is asked in terms of the IDB and the ResoB definitions include only EDB predicates (and IDB 
predicates defined in such a way that they can be compiled to EDB predicates without recursion). 

Within the case where recursion only occurs in the query we start with the subcase where everything 
is positive, that is, there is no negation. The database restrictions are as in Section 4.1 except that 
the query is recursive. This case was solved in [DGLOO]. We start by considering their Example 3.1. 



Example 10. EDB: edge{X,Y) IDB: 
IC: 

ResDR: r{X,Y) ^ edge{X, Z),edge{Z,Y) 
Query: q{X, Y) ^ edge{X, Y) 

q{X,Y)^edge{X,Z),q{Z,Y) 



In this example the recursive query determines the transitive closure of the relation edge, while the 
resource predicate stores endpoints of paths of length 2. The Clark Completion resource rules here 
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Query: ^ qiX,Y) 

q{X,Y)^edge{X,Z),q{Z,Y) 

edg(iiX,Z),q{Z,Y) 

CCrrl: edge(X' , f(X' ,Y')) ^ r(X',Y') 

{Z/f{XX):X'/X} 

r{X,Y'),q{f{X,Y'),Y) 

q{X",Y") ^edgeiX",Y" 



{X"/f{X,Y'),r/Y} 
r{X,Y'),edge{f{Xy),Y) 

CCrr2: edge{f{X" ,Y"),Y") ^ r{X" ,Y" 



{X"/X, Y"/Y,Y'/Y} 

r{X,Y),r{X,Y) 
factor 

r{X,Y) 

Figure 13: One branch of the recursive proof tree 



are 

CCrrl : edge{X, f{X,Y)) ^ r{X,Y). 
CCrr2 : edge{f{X,Y),Y) ^ r{X,Y). 

In this type of situation the solution is a two-step process. In the first step the extensional pred- 
icates, edge in this case, are evaluated from the resource predicates using the Clark Completion 
resource rules, and in the second step a bottom-up Datalog evaluation is done for the recursive 
query from the extensional predicates. It is shown there why this process always terminates in a 
finite amount of time. Specifically, they note that the key observation is that function symbols are 
only introduced in inverse rules. Because inverse rules are not recursive, no terms with nested func- 
tion symbols can be generated. As we mentioned earlier the problem with the top-down approach 
is that it is difficult to find out when to stop processing. We draw one branch of the top-down tree 
in Figure 13. 

The second subcase is where the recursive query is stratified, so negation is allowed. A similar 
two-step process works here also. The first step is the same as before. However, in the second step 
the query is computed by computing the stratified database [U1189] . 



The final subcase we consider is where the resources are defined by multiple definitions (but still 
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without recursion) as in Section 6. Recall that now the Clark Completion resource rules will 
contain disjunctions in the head. We illustrate the idea by reconsidering the simple example from 
the beginning of Section 6 where one Clark completion resource rule (written as a single rule) is: 

ipi{X) Ap2{X)) VpsiX) ^ r{X) (112) 

Suppose we have r{a). Then we separately do three subcomputations: 

• generate pi (a) andp2{a), 

• generate P3{a), 

• generate pi{a), P2{a), and P3{a). 

In Step 1 apply this process to all the Clark Completion resource rules to obtain all the subcompu- 
tations. For every subcomputation apply Step 2, the bottom-up Datalog evaluation of the query. 

Place a tuple in the answer only if it is an answer in every subcomputation. This is essentially the 
minimal models approach for disjunctive logic programming [LMR92]. 

We end our exploration of recursion by considering the case where recursion is allowed in the 
ResDB- In general, there is no known method to handle this. In the following, we discuss a 
specific form of recursion for which we obtain sound answers. The problem is how to get the Clark 
Completion resource rules in a usable form to give us definitions of extensional predicates in terms of 
resource predicates. Going back to the previous edge example, suppose that instead of the resource 
predicate storing endpoints of paths of length 2, the resource predicate stores the endpoints of all 
paths. This would be written as 
ResDB: r{X,Y) ^ edge{X,Y) 

r{X,Y)^edge{X,Z),r{Z,Y) 
Now we write a modified Clark Completion resource rule as: 

MCCrr : edge{X, Y) ^ r{X, Y),noGZ{r{X, Z), r{Z, Y)) (113) 

obtaining all those paths in r that cannot be broken up into two paths. The two steps of the 
computation process are as before: evaluate edge first from r (using the modified Clark Completion 
resource rule) and then evaluate q from edge whether or not q is recursive. The general result is as 
follows: 

Proposition 8.1. Suppose that a ResDB predicate r is recursive and has the form 

r{X) ^ e{X) 
r{X) ^ t{Y) 

where t is a conjunction that may include e and r and X C.Y 
Then the modified Clark Completion resource rule is: 

MCCrr : e{X) ^ r{X),not3Zt{Y) 
where Z = Y — X . 

MCCrr can be used to evaluate e from r in a sound manner. 

Proof. We proceed by contraposition. Suppose that for some tuple a e{d) is false and r(a) is true. 
This means that r(a) must have been obtained by the second, (recursive) definition. But then 
3Zt{Y) must be true. □ 



41 



9 Comparison 



In this section we discuss the contributions made in this paper and compare the work with other 
efforts. 

Qian ([Qia96]) was the first to consider the problem of folding. In her paper she introduced the 
concept of inverse rules to permit one to use resources to compute answers to queries. Inverse 
rules basically state that the only way in which the information in the resource can be computed 
is through the resource. As noted in this paper, the concept of inverse rules was introduced first in 
the context of logic programming by Clark [Qia96], and is the basis of the closed world assumption. 
It represents an if-and-only-if condition for rules. Qian showed that if, for each resource predicate, 
defined by a conjunctive rule, there is at most one such rule, that it is possible to compute the 
answer to queries using resources in many instances. We have extended Qian's result slightly to 
include databases that contain arbitrary integrity constraints. Duschka and his associates [Dus97a, 
DG97a, Dus97b] show how to extend the work to classes of integrity constraints. [DG97a, Dus97b] 
were the first to extend the work to handle general recursive queries. 

Dawson, Gryz and Qian [DGQ96] show how to compute answers in query folding when there are 
functional dependencies. [DL97, Dus97b] introduced the new class of recursive query plans for infor- 
mation gathering. Instead of plans being only sets of conjunctive queries, they can now be recursive 
sets of function-free Horn clauses. Using recursive plans, they settle two open problems. First, they 
describe an algorithm for finding the maximally contained rewriting in the presence of functional 
dependencies. Second, they describe an algorithm for finding the maximally-contained rewriting in 
the presence of binding-pattern restrictions, which was not possible without recursive plans. We 
have shown how integrity constraints containing equality, such as functional dependencies, generate 
a possibly infinite set of folded queries. 

Duschka and Genesereth [DG98, Dus97b] developed the first algorithm to solve the problem of 

answering queries using views when view definitions are allowed to contain disjunction. They use 
the Clark completion to obtain the inverse rules. Disjunctive definitions for rules implies that 
Datalog is not sufficiently powerful to handle such situations and it is necessary to use Datalog^ . 
Their focus is on maximal query containment. They show a duality in between a query plan being 
maximally contained in a query and this plan computing exactly the certain answers. They show 
that the disjunctive plan can be evaluated in co-NP time. Afrati et al. [AGK98] also treats the 
problem of disjunctive materialized views. The relationship with our approach is that we show 
how, using theorem proving concepts we can handle such theories, including integrity constraints. 
We also show how one can determine if the query plan that has been developed is complete, that 
is, if all answers to the original query have been obtained. We do this in the context of theorem 
proving. 

We know of no work that covers negation in the folding problem. We show how to handle stratified 
negation in views which may be defined by disjunctive rules, and may contain integrity constraints. 
Duschka [Dus97b] discusses complexity results with respect to negation in his thesis. 

With respect to recursion, as noted above, [DG97a] handle recursion. We show how to handle 
recursion in a similar manner to their work, and extend the results to stratified databases. In 
general, there is no known method to handle recursion in resource rules. We suggest one limited 
case where recursion appears in resource rules and sound answers may be found. 
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10 Summary 



We have shown that a logic-based approach using resolution unifies techniques used earlier for the 
aspect of data integration also known as query folding. Wc considered a deductive database where 
a query written on the database needs to be rewritten in terms of given resources. We showed 
how to handle integrity constraints and the case where a resource has multiple definitions. We also 
showed when the folded query yields all or only some of the answers of the original query. We 
extended our results to some cases involving negation and recursion. 
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